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Abstract 

Compositional embedding models build 
a representation (or embedding) for a 
linguistic structure based on its compo¬ 
nent word embeddings. We propose 
a Feature-rich Compositional Embedding 
Model (FCM) for relation extraction that 
is expressive, generalizes to new domains, 
and is easy-to-implement. The key idea 
is to combine both (unlexicalized) hand¬ 
crafted features with learned word em¬ 
beddings. The model is able to directly 
tackle the difficulties met by traditional 
compositional embeddings models, such 
as handling arbitrary types of sentence an¬ 
notations and utilizing global information 
for composition. We test the proposed 
model on two relation extraction tasks, 
and demonstrate that our model outper¬ 
forms both previous compositional models 
and traditional feature rich models on the 
ACE 2005 relation extraction task, and the 
SemEval 2010 relation classification task. 

The combination of our model and a log- 
linear classifier with hand-crafted features 
gives state-of-the-art results. We made our 
implementation available for general use 1 . 

1 Introduction 

Two common NLP feature types arc lexical 
properties of words and unlexicalized linguis¬ 
tic/structural interactions between words. Prior 
work on relation extraction has extensively stud¬ 
ied how to design such features by combining dis¬ 
crete lexical properties (e.g. the identity of a word, 

‘Gormley and Yu contributed equally. 

1 https://github.com/mgormley/pacaya 


its lemma, its morphological features) with as¬ 
pects of a word's linguistic context (e.g. whether it 
lies between two entities or on a dependency path 
between them). While these help learning, they 
make generalization to unseen words difficult. An 
alternative approach to capturing lexical informa¬ 
tion relies on continuous word embeddings 2 as 
representative of words but generalizable to new 
words. Embedding features have improved many 
tasks, including NER, chunking, dependency pars¬ 
ing, semantic role labeling, and relation extrac¬ 
tion (Miller et al., 2004; Turian et al., 2010; Koo 
et al., 2008; Roth and Woodsend, 2014; Sun et 
al., 2011; Plank and Moschitti, 2013; Nguyen and 
Grishman, 2014). Embeddings can capture lexi¬ 
cal information, but alone they are insufficient: in 
state-of-the-art systems, they arc used alongside 
features of the broader linguistic context. 

In this paper, we introduce a compositional 
model that combines unlexicalized linguistic con¬ 
text and word embeddings for relation extraction, 
a task in which contextual feature construction 
plays a major role in generalizing to unseen data. 
Our model allows for the composition of embed¬ 
dings with arbitrary linguistic structure, as ex¬ 
pressed by hand crafted features. In the follow¬ 
ing sections, we begin with a precise construction 
of compositional embeddings using word embed¬ 
dings in conjunction with unlexicalized features. 
Various feature sets used in prior work (Turian et 
al., 2010; Nguyen and Grishman, 2014; Hermann 
et al., 2014; Roth and Woodsend, 2014) are cap- 

2 Such embeddings have a long history in NLP, in¬ 
cluding term-document frequency matrices and their low¬ 
dimensional counterparts obtained by linear algebra tools 
(LSA, PCA. CCA, NNMF), Brown clusters, random projec¬ 
tions and vector space models. Recently, neural networks / 
deep learning have provided several popular methods for ob¬ 
taining such embeddings. 



Class 

Mi 

m 2 

Sentence Snippet 

(1) ART(Mi,M 2 ) 

(2) PART-WHOLE(Mi,M 2 ) 

(3) PHYSICAL(M 2 ,Mi) 

a man 

the southern suburbs 
the united states 

a taxicab 
Baghdad 
284 people 

A man driving what appeared to be a taxicab 
direction of the southern suburbs of Baghdad 
in the united states , 284 people died 


Table 1: Examples from ACE 2005. In (1) the word “driving” is a strong indicator of the relation ART 3 between Mi and M 2 . 
A feature that depends on the embedding for this context word could generalize to other lexical indicators of the same relation 
(e.g. “operating”) that don’t appear with ART during training. But lexical information alone is insufficient; relation extraction 
requires the identification of lexical roles: where a word appears structurally in the sentence. In (2), the word “of’ between 
“suburbs” and “Baghdad” suggests that the first entity is part of the second, yet the earlier occurrence after “direction” is of no 
significance to the relation. Even finer information can be expressed by a word’s role on the dependency path between entities. 
In (3) we can distinguish the word “died” from other irrelevant words that don’t appear between the entities. 


tured as special cases of this construction. Adding 
these compositional embeddings directly to a stan¬ 
dard log-linear model yields a special case of our 
full model. We then treat the word embeddings 
as parameters giving rise to our powerful, efficient, 
and easy-to-implement log-bilinear model. The 
model capitalizes on arbitrary types of linguistic 
annotations by better utilizing features associated 
with substructures of those annotations, including 
global information. We choose features to pro¬ 
mote different properties and to distinguish differ¬ 
ent functions of the input words. 

The full model involves three stages. First, it 
decomposes the annotated sentence into substruc¬ 
tures (i.e. a word and associated annotations). 
Second, it extracts features for each substructure 
(word), and combines them with the word’s em¬ 
bedding to form a substructure embedding. Third, 
we sum over substructure embeddings to form a 
composed annotated sentence embedding, which 
is used by a final softmax layer to predict the out¬ 
put label (relation). 

The result is a state-of-the-art relation extractor 
for unseen domains from ACE 2005 (Walker et al., 
2006) and the relation classification dataset from 
SemEval-2010 Task 8 (Hendrickx et al., 2010). 

Contributions This paper makes several contri¬ 
butions, including: 

1. We introduce the FCM, a new compositional 
embedding model for relation extraction. 

2. We obtain the best reported results on ACE- 
2005 for coarse-grained relation extraction in 
the cross-domain setting, by combining FCM 
with a log-linear model. 

3. We obtain results on on SemEval-2010 Task 
8 competitive with the best reported results. 

Note that other work has already been published 
that builds on the FCM, such as Flashimoto et al. 
(2015), Nguyen and Grishman (2015), dos Santos 

’In ACE 2005, ART refers to a relation between a person 
and an artifact; such as a user, owner, inventor, or manufac¬ 
turer relationship 


et al. (2015), Yu and Dredze (2015) and Yu et al. 
(2015). Additionally, we have extended FCM to 
incorporate a low-rank embedding of the features 
(Yu et al., 2015), which focuses on fine-grained 
relation extraction for ACE and ERE. This paper 
obtains better results than the low-rank extension 
on ACE coarse-grained relation extraction. 

2 Relation Extraction 

In relation extraction we are given a sentence as in¬ 
put with the goal of identifying, for all pairs of en¬ 
tity mentions, what relation exists between them, 
if any. For each pah of entity mentions in a sen¬ 
tence S, we construct an instance (y. x), where 
x = (Mi, M 2 , S, A). S = {wi,w 2 , is 

a sentence of length n that expresses a relation 
of type y between two entity mentions Mi and 
M 2 , where M\ and M 2 are sequences of words in 
S. A is the associated annotations of sentence S, 
such as part-of-speech tags, a dependency parse, 
and named entities. We consider directed rela¬ 
tions: for a relation type Rel, y=Rel{M\, M 2 ) 
and y'=Rel(Mo, M\) are different relations. Ta¬ 
ble 1 shows ACE 2005 relations, and has a strong 
label bias towards negative examples. We also 
consider the task of relation classification (Se- 
mEval), where the number of negative examples 
is artificially reduced. 

Embedding Models Word embeddings and 
compositional embedding models have been suc¬ 
cessfully applied to a range of NLP tasks, however 
the applications of these embedding models to re¬ 
lation extraction are still limited. Prior work on 
relation classification (e.g. SemEval 2010 Task 8) 
has focused on short sentences with at most one 
relation per sentence (Socher et al., 2012; Zeng 
et al., 2014). For relation extraction, where neg¬ 
ative examples abound, prior work has assumed 
that only the named entity boundaries and not 
their types were available (Plank and Moschitti, 
2013; Nguyen et al., 2015). Other work has as- 








sumed that the order of two entities in a relation 
arc given while the relation type itself is unknown 
(Nguyen and Grishman, 2014; Nguyen and Grish- 
man, 2015). The standard relation extraction task, 
as adopted by ACE 2005 (Walker et ah, 2006), 
uses long sentences containing multiple named en¬ 
tities with known types 4 and unknown relation di¬ 
rections. We arc the first to apply neural language 
model embeddings to this task. 

Motivation and Examples Whether a word is 
indicative of a relation depends on multiple prop¬ 
erties, which may relate to its context within the 
sentence. For example, whether the word is in- 
between the entities, on the dependency path be¬ 
tween them, or to their left or right may provide 
additional complementary information. Illustra¬ 
tive examples arc given in Table 1 and provide 
the motivation for our model. In the next section, 
we will show how we develop informative repre¬ 
sentations capturing both the semantic information 
in word embeddings and the contextual informa¬ 
tion expressing a word’s role relative to the entity 
mentions. We arc the first to incorporate all of 
this information at once. The closest work is that 
of Nguyen and Grishman (2014), who use a log- 
linear model for relation extraction with embed¬ 
dings as features for only the entity heads. Such 
embedding features arc insensitive to the broader 
contextual information and, as we show, are not 
sufficient to elicit the word’s role in a relation. 

3 A Feature-rich Compositional 
Embedding Model for Relations 

We propose a general framework to construct an 
embedding of a sentence with annotations on its 
component words. While we focus on the rela¬ 
tion extraction task, the framework applies to any 
task that benefits from both embeddings and typi¬ 
cal hand-engineered lexical features. 

3.1 Combining Features with Embeddings 

We begin by describing a precise method for con¬ 
structing substructure embeddings and annotated 
sentence embeddings from existing (usually un- 
lexicalized) features and embeddings. Note that 
these embeddings can be included directly in a 
log-linear model as features—doing so results in 

4 Since the focus of this paper is relation extraction, we 
adopt the evaluation setting of prior work which uses gold 
named entities to better facilitate comparison. 


a special case of our full model presented in the 
next subsection. 

An annotated sentence is first decomposed into 
substructures. The type of substructures can vary 
by task; for relation extraction we consider one 
substructure per word 5 . For each substructure in 
the sentence we have a hand-crafted feature vec¬ 
tor f Wi and a dense embedding vector e Wi . We 
represent each substructure as the outer product 
< 8 > between these two vectors to produce a matrix, 
herein called a substructure embedding: h Wi = 
fun <8> e Wi . The features f Wi are based on the local 
context in S and annotations in A, which can in¬ 
clude global information about the annotated sen¬ 
tence. These features allow the model to pro¬ 
mote different properties and to distinguish differ¬ 
ent functions of the words. Feature engineering 
can be task specific, as relevant annotations can 
change with regards to each task. In this work 
we utilize unlexicalized binary features common 
in relation extraction. Figure 1 depicts the con¬ 
struction of a sentence’s substructure embeddings. 

We further sum over the substructure embed¬ 
dings to form an annotated sentence embedding: 

n 

&X — ^ ' fwi &Wi (1) 

i=l 

When both the hand-crafted features and word em¬ 
beddings are treated as inputs, as has previously 
been the case in relation extraction, this anno¬ 
tated sentence embedding can be used directly as 
the features of a log-linear model. In fact, we 
find that the feature sets used in prior work for 
many other NFP tasks are special cases of this 
simple construction (Turian et al., 2010; Nguyen 
and Grishman, 2014; Hermann et ah, 2014; Roth 
and Woodsend, 2014). This highlights an im¬ 
portant connection: when the word embeddings 
are constant, our constructions of substructure and 
annotated sentence embeddings are just specific 
forms of polynomial (specifically quadratic) fea¬ 
ture combination—hence their commonality in the 
literature. Our experimental results suggest that 
such a construction is more powerful than directly 
including embeddings into the model. 

3.2 The Log-Bilinear Model 

Our full log-bilinear model first forms the sub¬ 
structure and annotated sentence embeddings from 

5 We use words as substructures for relation extraction, but 
use the general terminology to maintain model generality. 




Figure 1: Example construction of substructure embeddings. Each substructure is a word w % in S, augmented by the target 
entity information and related information from annotation A (e.g. a dependency tree). We show the factorization of the 
annotated sentence into substructures (left), the concatenation of the substmcture embeddings for the sentence (middle), and a 
single substructure embedding from that concatenation (right). The annotated sentence embedding (not shown) would be the 
sum of the substructure embeddings, as opposed to their concatenation. 


the previous subsection. The model uses its pa¬ 
rameters to score the annotated sentence embed¬ 
ding and uses a softmax to produce an output la¬ 
bel. We call the entire model the Feature-rich 
Compositional Embedding Model (FCM). 

Our task is to determine the label y (relation) 
given the instance x = (Mi, M 2 , S, A). We for¬ 
mulate this as a probability. 


P(y |x;T, e) 


ex P (EjU T y © (fwj © e Wi )) 
Z(x) 


( 2 ) 

where © is the ‘matrix dot product’ or Frobe- 
nious inner product of the two matrices. The 
normalizing constant which sums over all possi¬ 
ble output labels y' e L is given by Z(x) = 

Ey'eL ex P (E?=i Ty ' © (fwi © e Wi )). The pa¬ 
rameters of the model are the word embeddings 
e for each word type and a list of weight matrix 
T = [Ty] y& L which is used to score each label 
y. The model is log-bilinear 6 (i.e. log-quadratic) 
since we recover a log -linear model by fixing ei¬ 
ther e or T. We study both the full log-bilinear and 
the log-linear model obtained by fixing the word 
embeddings. 


3.3 Discussion of the Model 

Substructure Embeddings Si mi lar words (i.e. 
those with similar embeddings) with similar func¬ 
tions in the sentence (i.e. those with similar fea¬ 
tures) will have similar matrix representations. To 
understand our selection of the outer product, con¬ 
sider the example in Fig. 1. The word “driving” 
can indicate the ART relation if it appeal's on the 

'’Other popular log-bilinear models are the log-bilinear 
language models (Mnih and Hinton, 2007; Mikolov et al., 
2013). 


dependency path between M\ and M 2 . Suppose 
the third feature in f Wi indicates this on-path 
feature. Our model can now learn parameters 
which give the third row a high weight for the 
ART label. Other words with embeddings similar 
to “driving” that appeal' on the dependency path 
between the mentions will similarly receive high 
weight for the ART label. On the other hand, if the 
embedding is similar but is not on the dependency 
path, it will have 0 weight. Thus, our model gen¬ 
eralizes its model parameters across words with 
si mi lar embeddings only when they share similar 
functions in the sentence. 

Smoothed Lexical Features Another intuition 
about the selection of outer product is that it is 
actually a smoothed version of traditional lexical 
features used in classical NLP systems. Consider 
a lexical feature f = u A w, which is a conjunc¬ 
tion (logic-and) between non-lexical property u 
and lexical part (word) w. If we represent w as 
a one-hot vector, then the outer product exactly re¬ 
covers the original feature /. Then if we replace 
the one-hot representation with its word embed¬ 
ding, we get the current form of our FCM. There¬ 
fore, our model can be viewed as a smoothed ver¬ 
sion of lexical features, which keeps the expres¬ 
sive strength, and uses embeddings to generalize 
to low frequency features. 

Time Complexity Inference in FCM is much 
faster than both CNNs (Collobert et ah, 2011) and 
RNNs (Socher et ah, 2013b; Bordes et ah, 2012). 
FCM requires O(snd) products on average with 
sparse features, where s is the average number of 
per-word non-zero feature values, n is the length 
of the sentence, and d is the dimension of word 
embedding. In contrast, CNNs and RNNs usually 





have complexity ()(C ■ nd 2 ), where C is a model 
dependent constant. 

4 Hybrid Model 

We present a hybrid model which combines the 
FCM with an existing log-linear model. We do so 
by defining a new model: 

PFCM+loglin(y|.'c) = T^Pfcm (y\x )pioglin {y\x) (3) 

The log-linear model has the usual form: 
Piog\m{y\x) oc exp(0 • f{x,y)), where 6 are the 
model parameters and fix. y) is a vector of fea¬ 
tures. The integration treats each model as a pro¬ 
viding a score which we multiply together. The 
constant Z ensures a normalized distribution. 

5 Training 

FCM training optimizes a cross-entropy objective: 

£(D;T,e)= £ logP(y|x; T, e) 

(x,y)e£> 

where D is the set of all training data and e 
is the set of word embeddings. To optimize 
the objective, for each instance (■ y , x) we per¬ 
form stochastic training on the loss function £ = 
£(y, x; T,e) = log P(y|x; T, e). The gradi¬ 
ents of the model parameters arc obtained by 
backpropagation (i.e. repeated application of 
the chain rule). We define the vector s = 

Ei T V © (fwi © e Wi)\l<y<L’ which y ields 

Is = [W = y\- p (y'\^T’ e ))i< y < L ] » 

where the indicator function I[x] equals 1 if x is 
true and 0 otherwise. We have the following gradi¬ 
ents: J2i= i fwi © e viv which is equiv¬ 

alent to: 

= {i\y = y] - P(y' I*; t, e)) • ^ f Wi © e w , 

* z =i 


When we treat the word embeddings as parameters 
(i.e. the log-bilinear model), we also fine-tune the 
word embeddings with the FCM model: 



As is common in deep learning, we initialize 
these embeddings from an neural language model 
and then fine-tune them for our supervised task. 
The training process for the hybrid model (§ 4) 
is also easily done by backpropagation since each 
sub-model has separate parameters. 


Set 

Template 

HeadEmb 

{I[i = hi],I[i = h 2 ]} 

(Wi is head of Mi/M 2 ) x{ 0 , t h , , t ho , t h , © t ho } 

Context 

I[i = hi ± 1] (left/right token of Wh 1 ) 

I [i = h2 =h 1] (left/right token of Wh 2 ) 

In-between 

I[i > hi] 8 zl[i < /12] (in between ) 

On-path 

I[wi £ P] (on path) 

x{0, thi 5 ^h-2 ? th-\ © t'h'z } 


Table 2: Feature sets used in FCM. 

6 Experimental Settings 

Features Our FCM features (Table 2) use a fea¬ 
ture vector f Wi over the word uj % , the two tar¬ 
get entities Mi, M 2 , and their dependency path. 
Here h\, ha arc the indices of the two head words 
of M\, M 2 , x refers to the Cartesian product be¬ 
tween two sets, t/jj and t/, 2 arc entity types (named 
entity tags for ACE 2005 or WordNet supertags for 
SemEval 2010) of the head words of two entities, 
and d stands for the empty feature. 0 refers to the 
conjunction of two elements. The In-between 
features indicate whether a word Wi is in between 
two target entities, and the On-path features in¬ 
dicate whether the word is on the dependency 
path, on which there is a set of words P, between 
the two entities. 

We also use the target entity type as a feature. 
Combining this with the basic features results in 
more powerful compound features, which can help 
us better distinguish the functions of word embed¬ 
dings for predicting certain relations. For exam¬ 
ple, if we have a person and a vehicle, we know 
it will be more likely that they have an ART rela¬ 
tion. For the ART relation, we introduce a corre¬ 
sponding weight vector, which is closer to lexical 
embeddings similar to the embedding of “drive”. 

All linguistic annotations needed for fea¬ 
tures (POS, chunks 7 , parses) are from Stanford 
CoreNLP (Manning et al., 2014). Since SemEval 
does not have gold entity types we obtained Word- 
Net and named entity tags using Ciaramita and 
Altun (2006). For all experiments we use 200- 
d word embeddings trained on the NYT portion 
of the Gigaword 5.0 corpus (Parker et ah, 2011), 
with word2vec (Mikolov et ah, 2013). We use the 
CBOW model with negative sampling (15 nega¬ 
tive words). We set a window size c=5, and re¬ 
move types occurring less than 5 times. 

Models We consider several methods. (1) FCM 
in isolation without fine-tuning. (2) FCM in isola¬ 
tion with fine-tuning (i.e. trained as a log-bilinear 

7 Obtained from the constituency parse using the CONLL 
2000 chunking converter (Perl script). 




model). (3) A log-linear model with a rich binary 
feature set from Sun et al. (2011) (Baseline)— 
this consists of all the baseline features of Zhou et 
al. (2005) plus several additional carefully-chosen 
features that have been highly tuned for ACE-style 
relation extraction over years of research. We ex¬ 
clude the Country gazetteer and WordNet features 
from Zhou et al. (2005). The two remaining meth¬ 
ods are hybrid models that integrate FCM as a sub¬ 
model within the log-linear model (§ 4). We con¬ 
sider two combinations. (4) The feature set of 
Nguyen and Grishman (2014) obtained by using 
the embeddings of heads of two entity mentions 
(+HeadOnly). (5) Our full FCM model (+FCM). 
All models use L2 regularization tuned on dev 
data. 

6.1 Datasets and Evaluation 

ACE 2005 We evaluate our relation extraction 
system on the English portion of the ACE 2005 
corpus (Walker et al., 2006). 8 There are 6 do¬ 
mains: Newswire (nw). Broadcast Conversation 
(be). Broadcast News (bn). Telephone Speech 
(cts), Usenet Newsgroups (un), and Weblogs 
(wl). Following prior work we focus on the do¬ 
main adaptation setting, where we train on one set 
(the union of the news domains (bn+nw), tune 
hyperparameters on a dev domain (half of be) 
and evaluate on the remainder (cts, wl, and 
the remainder of be) (Plank and Moschitti, 2013; 
Nguyen and Grishman, 2014). We assume that 
gold entity spans and types are available for train 
and test. We use all pairs of entity mentions to 
yield 43,518 total relations in the training set. We 
report precision, recall, and FI for relation extrac¬ 
tion. While it is not our focus, for completeness 
we include results with unknown entity types fol¬ 
lowing Plank and Moschitti (2013) (Appendix 1). 

SemEval 2010 Task 8 We evaluate on the Se- 
mEval 2010 Task 8 dataset 9 (Hendrickx et al., 
2010) to compare with other compositional mod¬ 
els and highlight the advantages of FCM. This task 
is to determine the relation type (or no relation) 
between two entities in a sentence. We adopt the 
setting of Socher et al. (2012). We use 10-fold 

s Many relation extraction systems evaluate on the ACE 
2004 corpus (Mitchell et al., 2005). Unfortunately, the most 
common convention is to use 5-fold cross validation, treating 
the entirety of the dataset as both train and evaluation data. 
Rather than continuing to overfit this data by perpetuating the 
cross-validation convention, we instead focus on ACE 2005. 

9 

http://docs.google.com/View?docid=dfvxd49s_36c28v9pmw 


cross validation on the training data to select hy¬ 
perparameters and do regularization by early stop¬ 
ping. The learning rates for FCM with/without 
fine-tuning are 5e-3 and 5e-2 respectively. We 
report macro-FI and compare to previously pub¬ 
lished results. 

7 Results 

ACE 2005 Despite FCM’s (1) simple feature set, 
it is competitive with the log-linear baseline (3) 
on out-of-domain test sets (Table 3). In the typi¬ 
cal gold entity spans and types setting, both Plank 
and Moschitti (2013) and Nguyen and Grishman 
(2014) found that they were unable to obtain im¬ 
provements by adding embeddings to baseline fea¬ 
ture sets. By contrast, we find that on all do¬ 
mains the combination baseline + FCM (5) obtains 
the highest FI and significantly outperforms the 
other baselines, yielding the best reported results 
for this task. We found that fine-tuning of em¬ 
beddings (2) did not yield improvements on our 
out-of-domain development set, in contrast to our 
results below for SemEval. We suspect this is be¬ 
cause fine-tuning allows the model to overfit the 
training domain, which then hurts performance on 
the unseen ACE test domains. Accordingly, Ta¬ 
ble 3 shows only the log-linear model. 

Finally, we highlight an important contrast be¬ 
tween FCM (1) and the log-linear model (3): the 
latter uses over 50 feature templates based on a 
POS tagger, dependency parser, chunker, and con¬ 
stituency parser. FCM uses only a dependency 
parse but still obtains better results (Avg. FI). 

SemEval 2010 Task 8 Table 4 shows FCM 
compared to the best reported results from the 
SemEval-2010 Task 8 shared task and several 
other compositional models. 

For the FCM we considered two feature sets. We 
found that using NE tags instead of WordNet tags 
helps with fine-tuning but hurts without. This may 
be because the set of WordNet tags is larger mak¬ 
ing the model more expressive, but also introduces 
more parameters. When the embeddings are fixed, 
they can help to better distinguish different func¬ 
tions of embeddings. But when fine-tuning, it be¬ 
comes easier to over-tit. Alleviating over-fitting is 
a subject for future work (§ 9). 

With either WordNet or NER features, FCM 
achieves better performance than the RNN and 
MVRNN. With NER features and fine-tuning, it 
outperforms a CNN (Zeng et al., 2014) and also 



Model 

be | 

I cts 

] wl 

Avg. 

FI 

V 

R 

FI 

P 

R 

FI 

P 

R 

FI 

(1) FCM only (ST) 

66.56 

57.86 

61.90 

65.62 

44.35 

52.93 

57.80 

44.62 

50.36 

55.06 

(3) Baseline (ST) 

74.89 

48.54 

58.90 

74.32 

40.26 

52.23 

63.41 

43.20 

51.39 

54.17 

(4) + HeadOnly (ST) 

70.87 

50.76 

59.16 

71.16 

43.21 

53.77 

57.71 

42.92 

49.23 

54.05 

(5) + FCM (ST) 

74.39 

55.35 

63.48 

74.53 

45.01 

56.12 

65.63 

47.59 

55.17 

58.26 


Table 3: Comparison of models on ACE 2005 out-of-domain test sets. Baseline + HeadOnly is our 
reimplementation of the features of Nguyen and Grishman (2014). 


Classifier 

Features 

FI 

SVM (Rink and Harabagiu, 2010) 

(Best in SemEval2010) 

POS, prefixes, morphological, WordNet, dependency parse. 
Levin classed, ProBank, FrameNet, NomLex-Plus, 

Google n-gram. paraphrases, TextRunner 

82.2 

RNN 

RNN + linear 

word embedding, syntactic parse 

word embedding, syntactic parse, POS, NER. WordNet 

74.8 

77.6 

MVRNN 

MVRNN + linear 

word embedding, syntactic parse 

word embedding, syntactic parse, POS, NER, WordNet 

79.1 

82.4 

CNN (Zeng et al., 2014) 

word embedding, WordNet 

82.7 

CR-CNN (log-loss) 

CR-CNN (ranking-loss) 

word embedding 
word embedding 

82.7 

84.1 

RelEmb (word2vec embedding) 

RelEmb (task-spec embedding) 

RelEmb (task-spec embedding) + linear 

word embedding 
word embedding 

word embedding, dependency paths, WordNet. NE 

81.8 

82.8 

83.5 

DepNN 

DepNN + linear 

word embedding, dependency paths 

word embedding, dependency paths, WordNet. NER 

82.8 

83.6 

(1) FCM (log-linear) 

word embedding, dependency parse, WordNet 
word embedding, dependency parse, NER 

82.0 

81.4 

(2) FCM (log-bilinear) 

word embedding, dependency parse, WordNet 
word embedding, dependency parse, NER 

82.5 

83.0 

(5) FCM (log-linear) + linear (Hybrid) 

word embedding, dependency parse, WordNet 
word embedding, dependency parse, NER 

83.1 

83.4 


Table 4: Comparison of FCM with previously published results for SemEval 2010 Task 8. 


the combination of an embedding model and a tra¬ 
ditional log-linear model (RNN/MVRNN + lin¬ 
eal - ) (Socher et al., 2012). As with ACE, FCM uses 
less linguistic resources than many close competi¬ 
tors (Rink and Harabagiu, 2010). 

We also compared to concurrent work on en¬ 
hancing the compositional models with task- 
specific information for relation classification, in¬ 
cluding Hashimoto et al. (2015) (RelEmb), which 
trained task-specific word embeddings, and dos 
Santos et al. (2015) (CR-CNN), which proposed 
a task-specific ranking-based loss function. Our 
Hybrid methods (FCM + linear) get comparable re¬ 
sults to theirs. Note that their base compositional 
model results without any task-specific enhance¬ 
ments, i.e. RelEmb with word2vec embeddings 
and CR-CNN with log-loss, are still lower than the 
best FCM result. We believe that FCM can be also 
improved with these task-specific enhancements, 
e.g. replacing the word embeddings to the task- 
specific ones from (Hashimoto et al., 2015) in¬ 
creases the result to 83.7% (see §7.2 for details). 
We leave the application of ranking-based loss to 
future work. 


Finally, a concurrent work (Liu et al., 2015) 
proposes DepNN, which builds representations for 
the dependency path (and its attached subtrees) 
between two entities by applying recursive and 
convolutional neural networks successively. Com¬ 
pared to their model, our FCM achieves compa¬ 
rable results. Of note, our FCM and the RelEmb 
are also the most efficient models among all above 
compositional models since they have linear time 
complexity with respect to the dimension of em¬ 
beddings. 

7.1 Effects of the embedding sub-models 

We next investigate the effects of different types of 
features on FCM using ablation tests on ACE 2005 
(Table 5.) We focus on FCM alone with the fea¬ 
ture templates of Table 2. Additionally, we show 
results of using only the head embedding features 
from Nguyen and Grishman (2014) (HeadOnly). 
Not surprisingly, the HeadOnly model performs 
poorly (FI score = 14.30%), showing the impor¬ 
tance of our rich binary feature set. Among all the 
features templates, removing HeadEmb results in 
the largest degradation. The second most im- 





Feature Set 

Prec 

Rec 

FI 

HeadOnly 

31.67 

9.24 

14.30 

FCM 

69.17 

56.73 

62.33 

-HeadEmb 

66.06 

47.00 

54.92 

-Context 

70.89 

55.27 

62.11 

-In-between 

66.39 

51.86 

58.23 

-On-path 

69.23 

53.97 

60.66 

FCM-EntityTypes 

71.33 

34.68 

46.67 


Table 5: Ablation test of FCM on development set. 

portant feature template is In-between, while 
Context features have little impact. Remov¬ 
ing all entity type features (th t ) does significantly 
worse than the full model, showing the value of 
our entity type features. 

7.2 Effects of the word embeddings 

Good word embeddings are critical for both FCM 
and other compositional models. In this section, 
we show the results of FCM with embeddings 
used to initialize other recent state-of-the-art mod¬ 
els. Those embeddings include the 300-d baseline 
embeddings trained on English Wikipedia (w2v- 
enwiki-d300) and the 100-d task-specific embed¬ 
dings (task-specific-dlOO) 10 from the RelEmb pa¬ 
per (Hashimoto et al., 2015), the 400-d embed¬ 
dings from the CR-CNN paper (dos Santos et al., 
2015). Moreover, we list the best result (DepNN) 
in Liu et al. (2015), which uses the same embed¬ 
dings as ours. Table 6 shows the effects of word 
embeddings on FCM and provides relative compar¬ 
isons between FCM and the other state-of-the-art 
models. We use the same hyperparameters and 
number of iterations in Table 4. 

The results show that using different embed¬ 
dings to initialize FCM can improve FI beyond 
our previous results. We also find that increas¬ 
ing the dimension of the word embeddings does 
not necessarily lead to better results due to the 
problem of over-fitting (e.g.w2v-enwiki-d400 vs. 
w2v-enwiki-d300). With the same initial embed¬ 
dings, FCM usually gets better results without any 
changes to the hyperparameters than the compet¬ 
ing model, further confirming the advantage of 
FCM at the model-level as discussed under Ta¬ 
ble 4. The only exception is the DepNN model, 
which gets better result than FCM on the same 
embeddings. The task-specific embeddings from 
(Hashimoto et al., 2015) leads to the best perfor¬ 
mance (an improvement of 0.7%). This observa- 

10 In the task-specific setting, FCM will represent entity 
words and context words with separate sets of embeddings. 


Embeddings 

Model 

FI 

w2v-enwiki-d300 

RelEmb 

81.8 

(2) FCM (log-bilinear) 

83.4 


RelEmb 

82.8 

task-specific-d 1 00 

RelEmb+linear 

83.5 


(2) FCM (log-bilinear) 

83.7 

w2v-enwiki-d400 

CR-CNN 

82.7 

(2) FCM (log-bilinear) 

83.0 

w2v-nyt-d200 

DepNN 

(2) FCM (log-bilinear) 

83.6 

83.0 


Table 6: Evaluation of FCMs with different word 
embeddings on SemEval 2010 Task 8. 

tion suggests that the other compositional models 
may also benefit from the work of Hashimoto et 
al. (2015). 

8 Related Work 

Compositional Models for Sentences In order 
to build a representation (embedding) for a sen¬ 
tence based on its component word embeddings 
and structural information, recent work on compo¬ 
sitional models (stemming from the deep learning 
community) has designed model structures that 
mimic the structure of the input. For example, 
these models could take into account the order of 
the words (as in Convolutional Neural Networks 
(CNNs)) (Collobert et al., 2011) or build off of 
an input tree (as in Recursive Neural Networks 
(RNNs) or the Semantic Matching Energy Func¬ 
tion) (Socher et al., 2013b; Bordes et al., 2012). 

While these models work well on sentence-level 
representations, the nature of their designs also 
limits them to fixed types of substructures from the 
annotated sentence, such as chains for CNNs and 
trees for RNNs. Such models cannot capture arbi¬ 
trary combinations of linguistic annotations avail¬ 
able for a given task, such as word order, depen¬ 
dency tree, and named entities used for relation 
extraction. Moreover, these approaches ignore the 
differences in functions between words appealing 
in different roles. This does not suit more general 
substructure labeling tasks in NLR e.g. these mod¬ 
els cannot be directly applied to relation extraction 
since they will output the same result for any pair 
of entities in a same sentence. 

Compositional Models with Annotation Fea¬ 
tures To tackle the problem of traditional com¬ 
positional models, Socher et al. (2012) made the 
RNN model specific to relation extraction tasks by 
working on the minimal sub-tree which spans the 
two target entities. However, these specializations 







to relation extraction does not generalize easily to 
other tasks in NLP. There arc two ways to achieve 
such specialization in a more general fashion: 

1. Enhancing Compositional Models with Fea¬ 
tures. A recent trend enhances compositional 
models with annotation features. Such an ap¬ 
proach has been shown to significantly improve 
over pure compositional models. For example, 
Hermann et al. (2014) and Nguyen and Grishman 
(2014) gave different weights to words with dif¬ 
ferent syntactic context types or to entity head 
words with different argument IDs. Zeng et al. 
(2014) use concatenations of embeddings as fea¬ 
tures in a CNN model, according to their posi¬ 
tions relative to the target entity mentions. Be- 
linkov et al. (2014) enrich embeddings with lin¬ 
guistic features before feeding them forward to a 
RNN model. Socher et al. (2013a) and Hermann 
and Blunsom (2013) enhanced RNN models by 
refining the transformation matrices with phrase 
types and CCG super tags. 

2. Engineering of Embedding Features. A dif¬ 
ferent approach to combining traditional linguistic 
features and embeddings is hand-engineering fea¬ 
tures with word embeddings and adding them to 
log-linear models. Such approaches have achieved 
state-of-the-art results in many tasks including 
NER, chunking, dependency parsing, semantic 
role labeling, and relation extraction (Miller et ah, 
2004; Turian et al., 2010; Koo et al., 2008; Roth 
and Woodsend, 2014; Sun et al., 2011; Plank and 
Moschitti, 2013). Roth and Woodsend (2014) con¬ 
sidered features similar to ours for semantic role 
labeling. 

However, in prior work both of above ap¬ 
proaches are only able to utilize limited informa¬ 
tion, usually one property for each word. Yet there 
may be different useful properties of a word which 
can contribute to the performances of the task. By 
contrast, our FCM can easily utilize these features 
without changing the model structures. 

In order to better utilize the dependency anno¬ 
tations, recently work built their models according 
to the dependency paths (Ma et al., 2015; Liu et 
al., 2015), which share similar motivations to the 
usage of On-path features in our work. 

Task-Specific Enhancements for Relation Clas¬ 
sification An orthogonal direction of improving 
compositional models for relation classification is 
to enhance the models with task-specific informa¬ 
tion. For example, Hashimoto et al. (2015) trained 


task-specific word embeddings, and dos Santos et 
al. (2015) proposed a ranking-based loss function 
for relation classification. 

9 Conclusion 

We have presented FCM, a new compositional 
model for deriving sentence-level and substruc¬ 
ture embeddings from word embeddings. Com¬ 
pared to existing compositional models, FCM can 
easily handle arbitrary types of input and handle 
global information for composition, while remain¬ 
ing easy to implement. We have demonstrated 
that FCM alone attains near state-of-the-art perfor¬ 
mances on several relation extraction tasks, and 
in combination with traditional feature based log- 
linear models it obtains state-of-the-art results. 

Our next steps in improving FCM focus on en¬ 
hancements based on task-specific embeddings or 
loss functions as in Hashimoto et al. (2015; dos 
Santos et al. (2015). Moreover, as the model pro¬ 
vides a general idea for representing both sen¬ 
tences and sub-structures in language, it has the 
potential to contribute useful components to vari¬ 
ous tasks, such as dependency parsing, SRL and 
paraphrasing. Also as kindly pointed out by one 
anonymous reviewer, our FCM can be applied to 
the TAC-KBP (Ji et al., 2010) tasks, by replac¬ 
ing the training objective to a multi-instance multi¬ 
label one (e.g. Surdeanu et al. (2012)). We plan to 
explore the above applications of FCM in the fu¬ 
ture. 
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Appendix 1: Experiments on ACE 2005 
where Gold Entity Types Are Unknown 

Experimental Settings: For comparison with 
prior work (Plank and Moschitti, 2013), we (1) 
generate relation instances from all pairs of enti¬ 
ties within each sentence with three or fewer inter¬ 
vening entity mentions—labeling those pairs with 
no relation as negative instances, (2) use gold en¬ 
tity spans (but not types) at train and test time, and 
(3) evaluate on the 7 coarse relation types, ignor¬ 
ing the subtypes. In the training set, 35,990 to¬ 
tal relations are annotated of which only 3,658 are 
non-nil relations. We did not match the number of 
tokens they reported in the cts and wl domains. 
Therefore, in this section we only report the re¬ 
sults on the test set of be domain. We will leave 
experiments on additional domains in future work. 

We run the same models as in §7 on this task. 
Here the FCM does not use entity type features. 
Plank and Moschitti (2013) also use Brown clus¬ 
ters and word vectors learned by latent-semantic 
analysis (LSA). In order to make a fair compar¬ 
ison with their method, we also report the FCM 
result using Brown clusters (prefixes of length 
5) of entity heads as entity types. Furthermore, 
we report non-comparable settings using Word- 
Net super-sense tags of entity heads as types. The 
WordNet features were also used in their paper 
but not as substitution of entity types. We use the 
same toolkit to get the WordNet tags as in §6. The 
Brown clusters arc from (Koo et ah, 2008) 11 . 

Results: Table 7 shows the results under the 
low-resource setting. When no entity types are 
available, the performance of our FCM only model 
greatly decreases to 48.15%, which is consis¬ 
tent with our observation in the ablation tests. 
The baseline model also relies heavily on the 
entity types. After we remove all the hand¬ 
engineering features that contain entity type in¬ 
formation, the performance of our baseline model 
drop to 40.62%, even lower than the reduced FCM 
only model. 

The combination of baseline model and head 
embeddings (Baseline + HeadOnly) greatly im¬ 
prove the results. This is consistent with the ob¬ 
servation in Nguyen and Grishman (2014) that 
when the gold entity types arc unknown, informa¬ 
tion of the entity heads provided by their embed- 

11 http://people.csail.mit.edu/maestro/ 
papers/bllip-clusters .gz 


dings will play a more important role. Combina¬ 
tion of the baseline and FCM (Baseline + FCM) also 
achieves improvement but not significantly better 
than Baseline + HeadOnly. A possible explana¬ 
tion is that FCM becomes less efficient on using 
context word embeddings when the entity type in¬ 
formation is unavailable. In this situation the head 
embeddings provided by FCM become the domi¬ 
nating contribution to the baseline model, making 
the model have similar behavior as the Baseline + 
HeadOnly method. 

Finally, we find Brown clusters can help FCM 
when entity types are unknown. Although the per¬ 
formance is still not significantly better than Base¬ 
line + HeadOnly, it outperforms all the results in 
Plank and Moschitti (2013) as a single model , and 
with the same source of features. WordNet super¬ 
sense tags further improves FCM, and achieves the 
best reported results on this low-resource setting. 
These results are encouraging since it shows FCM 
may be more useful under the end-to-end setting 
where predictions of both entity mentions and re¬ 
lation mentions arc required in place of predicting 
relation based on gold tags (Li and Ji, 2014). 

Recently Nguyen et al. (2015) proposed a novel 
way of applying embeddings to tree-kernels. From 
the results, our best single model achieves com¬ 
parable result with their best single system, while 
their combination method is slightly better than 
ours. This suggests that we may benefit more from 
combining the usages of multiple word represen¬ 
tations; and we will investigate it in future work. 


Model 

be | 

p 

R 

FI 

PM’13 (Brown) 

54.4 

43.4 

48.3 

PM’13 (LSA) 

53.9 

45.2 

49.2 

PM’13 (Combination) 

55.3 

43.1 

48.5 

(1) FCM only 

53.7 

43.7 

48.2 

(3) Baseline 

59.4 

30.9 

40.6 

(4) + HeadOnly 

64.9 

41.3 

50.5 

(5) + FCM 

65.5 

41.5 

50.8 

(1) FCM only w/ Brown 

64.6 

40.2 

49.6 

(1) FCM only w/WordNet 

64.0 

43.2 

51.6 

Linear+Emb 

46.5 

49.3 

47.8 

Tree-kernel+Emb (Single) 

57.6 

46.6 

51.5 

Tree-kernel+Emb (Combination) 

58.5 

47.3 

52.3 


Table 7: Comparison of models on ACE 2005 out-of- 
domain test sets for the low-resource setting, where the 
gold entity spans are known but entity types are unknown. 
PM'13 is the results reported in Plank and Moschitti (2013). 
“Linear+Emb” is the implementation of our method (4) in 
(Nguyen et al., 2015). The “Tree-kernel+Emb” methods are 
the enrichments of tree-kernels with embeddings proposed by 
Nguyen et al. (2015). 






